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ABSTRACT 

We present a new implementation of the Floyd-Warshall All- 
Pairs Shortest Paths algorithm on CUDA[j Our algorithm 
runs approximately 5 times faster than the previously best 
reported algorithm. In order to achieve this speedup, we 
applied a new technique to reduce usage of on-chip shared 
memory and allow the CUDA scheduler to more effectively 
hide instruction latency. 

Categories and Subject Descriptors 

D.1.3 [Programming Techniques]: Concurrent Program- 
ming - Parallel Programming 

General Terms 

Performance 
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1. INTRODUCTION 

Graphics Processing Units are parallel processors offering 
high FLOPS/sec for low cost. NVIDIA's CUDA framework 
makes it practical to take advantage of this resource for 
high performance computing by providing a scalable pro- 
gramming model and a stable and comprehensible machine 
model to program against. Even with the tools provided by 
CUDA, making full use of the inherent processing power of 
a GPU can be a challenge, since the speed of a computation 
may not be limited by the speed of the processing units, but 
can also be limited by memory bus bandwidth or problems 
with thread scheduling. 

Algorithms that find the length of the shortest path between 
every pair of vertices in a weighted and directed graph solve 



^ After completion of this paper, the authors discovered that 
similar results were obtained by Bulug, Gilbert and Bu- 
dak[I]. 



the all pairs shortest paths problem. This class of algo- 
rithms, including the Floyd-Warshall algorithm (e.g. [2]), 
have a wide variety of applications in areas as diverse as 
bioinformatics, routing, and network analysis. The Floyd- 
Warshall algorithm is a standard approach for all pairs short- 
est paths problem that works with negative edge weights, 
doesn't suffer performance degradation for dense graphs, 
and has predictable execution regardless of the underlying 
data. It has can be decomposed into n"^ atomic tasks and 
runs in O(n^) time, where n is the number of vertices in the 
graph. 

Implementations of the Floyd-Warshall algorithm for CUDA 
have been presented in the past. Harish and Narayanan 
[3] presented an implementation assigning a single thread 
for each atomic task. This execution time of this approach 
was limited by the time necessary to pass data over the 
bus between global memory and the multiprocessors. Katz 
and Kider [4] achieved a speedup of 5 — 6x over Harish 
and Narayanan's simple implementation by using a blocked 
approach that performs several tasks for each data element 
passed over the global memory bus. It requires three 32 
by 32 tiles of data elements from the adjacency matrix to 
be stored in on chip shared memory for the duration of the 
execution of a thread block, which limits the flexibility of 
the GPU thread scheduler, resulting in significant exposed 
instruction latency. 

Our implementation achieves a further speedup of over 5x 
over the implementation of Katz and Kider. This includes a 
2.1 — 2.3 X speedup from reducing the instruction count and 
using less expensive instructions, and a 2.3 — 2.5 x speedup 
from reducing the shared memory required by each thread 
block, allowing the thread scheduler to more effectively hide 
latency. Our improvements to shared memory usage consist 
of making efficient use of the on chip registers and staging 
the algorithm so that not all shared data dependencies for a 
block need to be in shared memory at any one time. On the 
NVIDIA Tesla C1060, our implementation is able to solve 
APSP for any graph with single precision edge weights con- 
taining 16,384 vertices in 53.06 seconds^ This is over 150x 
as fast as a basic Floyd-Warshall implementation running 
on our CPU. 



This paper is organized as follows. In Section [2l we discuss 
previous work in optimizing the Floyd-Warshall algorithm. 



From this, one can see the implied constant (in seconds) for 
the run time of this 0(n'^) algorithm is around 1.2 • 10~^^. 



1: {Assume all Wxy are initialized to the weight of edge 

2: for all 1 s; fc ^ n do 

3: for all 1 ^ i ^ n do 

4: for all 1 s; J < n do 

5: Wij ^ ■min{wij,Wik + Wkj) 

6: end for 

7: end for 

8: end for 

Figure 1: The Floyd- Warshall Algorithm 

In Section [S] we discuss the blocked Floyd- Warshall algo- 
rithm used by Katz and Kider ([J) in more depth. We dis- 
cuss the improvements we made to the basic blocked Floyd- 
Warshall in Section [l] In Section O we present our results 
and show that they are close to the best possible on our 
hardware. 

2. PREVIOUS WORK 

An important improvement in cache utilization for the Floyd- 
Warshall algorithm was found by Venkataraman et al. [S]. 
(Ferencz and Diament presented a similar improvement a 
few years earlier in [6].) The method of Venkataraman et 
al. first partitions the incidence matrix into multiple tiles. 
From this, they noted that each tile of the matrix may be 
processed more extensively (i.e., one may iterate the "outer 
loop" multiple times over the tile) before proceeding to an- 
other tile. This approach allows for a significant improve- 
ment in cache utilization. They reported a speedup of be- 
tween 1.6x and 1.9x over the standard implementation on 
their targeted architecture. 

Further work on improving cache utilization was done by 
Penner and Prasanna jTj, and also Han et al. (8j. The work 
of Han et al. is particularly useful. They created a program 
that generates an implementation of the Floyd- Warshall al- 
gorithm that is optimized for a specific architecture, includ- 
ing the use of SIMD vectorization that is available on mod- 
ern PC processors. 

Bondhugula et al. [5] consider the efficient use of FPGAs 
(Field Programmable Gate Arrays) to improve performance 
of Floyd- Warshall. They noted (published in 2006) that 
their result was "an improvement by a factor of tens over 
the CPU implementation". 

Harish and Narayanan demonstrated the use of CUDA to 
solve APSP (along with several other graph algorithms) [3| . 
Soon after, a significant improvement was made by Katz and 
Kider Their implementation utilized the blocking algo- 
rithm described by Venkataraman et al. 5^ . Their approach 
resulted in a 5x-6.5x improvement over that of Harish and 
Narayanan. 

3. BLOCKED FLOYD-WARSHALL ALGO- 
RITHM ON CUDA 

3.1 The Floyd-Warshall Algorithm 

The Floyd-Warshall algorithm is a familiar algorithm to 
solve the all pairs shortest paths problem. Let G — {V, E) be 
a weighted directed graph, where V = {v\,V2, ■ ■ ■ ,Vn\. Let 



Wij be the weight assigned to the edge {vi,Vj). If {vi,Vj) ^ 
E, then Wi^j = oo, and for all Vi G V , Wi^i — 0. Otherwise, 
uJij £ R, such that no negative cycles exist. Let d^''^ be the 
shortest path from Vi to Vj that (only) passes through some 
subset of {vi,V2, ■ ■ ■ ,Vk}- The basic Floyd-Warshall algo- 

( k) 

rithm successively calculates dl J by iterating over k and 
updating Wij to be the shortest path from i to j using only 
vertices through k. (see Figure [1]) At completion, the in- 
cidence matrix W contains d["j for all i and j. 

Harish and Narayanan 3 implemented Floyd-Warshall on 
CUDA in the simplest and most direct way possible. They 
created a kernel containing a single iteration of the inner- 
most loop, and repeatedly assigned as many threads run- 
ning this kernel as possible. Since the GPU is designed to 
efficiently schedule a large number of threads, this approach 
works reasonably well and typically offers some speedup over 
the same algorithm on a CPU, but it suffers from global 
memory bus saturation. 

It is easy to see that the core algorithm requires 3 words to 
be sent from global memory to the multiprocessors for the 
calculation and 1 word to be sent back to global for stor- 
age. This is 16 bytes sent across the global memory bus for 
each task performed. On our setup, the measured speed for 
device-to-device memcpy operations for the NVIDIA Tesla 
CI 060 is 77 GB/sec. Based on this, we can calculate that we 
can expect to perform fewer than 4.8 * 10® tasks per second, 
and in fact our measured speed for this algorithm is ap- 
proximately 2.6* 10® tasks/sec (see section [5}. On the other 
hand, the advertised computing speed for the NVIDIA Tesla 
CI 060 is 933 GFLOPs/sec. Even unoptimized, the basic 
calculation of finding the correct index to process, adding, 
and taking a minimum requires fewer than the 359 FLOPs 
the Tesla is capable of performing in the measured time for 
a single task. 

3.2 Blocked Floyd-Warshall 

Katz and Kider [4] recognized this problem and applied ex- 
isting research (e.g. [S]) on improving cache efficiency of 
the Floyd-Warshall algorithm to make practical use of the 
shared memory associated with each multiprocessor in the 
GPU (see [1^ and [11] for details on the CUDA machine and 
programing models). The algorithm they used is a blocked 
Floyd-Warshall that performs several tasks for each data el- 
ement transferred across the memory bus. Figure [2] shows a 
basic sequential blocked Floyd-Warshall algorithm. 

In the blocked algorithm, the matrix is partitioned into tiles 
and the tasks are performed in stages. Katz and Kider use 
32 X 32 tiles, and 32 tasks are performed for each data ele- 
ment in the matrix during each stage of the algorithm. Each 
stage requires the execution of three kernels. The first kernel 
calculates 32 tasks for each data element in a single tile on 
the diagonal of the adjacency matrix, termed the "indepen- 
dent block." Tasks in the independent block depend only on 
other tasks in the independent block, or on tasks that were 
computed in a previous stage. The second kernel calculates 
32 tasks for each data element in each tile aligned with the 
independent block in either the i- or j- direction. These 
are the "singly dependent blocks." Tasks in the singly de- 
pendent blocks have one data dependency within the block. 



and one dependency in the already calculated independent 
block. There are Q{n) singly dependent blocks in each stage. 
Finally, the third kernel calculates 32 tasks for each of the 
remaining tiles. These are the "doubly dependent blocks," 
and each of these blocks depends entirely on two singly de- 
pendent blocks. Since there are no dependencies internal 
to a doubly dependent block, these tasks may be performed 
in any order. There are Q{n^) doubly dependent blocks in 
each stage, and it is the efficiency with which this stage is 
performed that determines the speed of the algorithm. 

In Katz and Kider's implementation of the doubly depen- 
dent kernel for this algorithm, two singly dependent tiles 
containing dependencies and the doubly dependent tile be- 
ing processed are loaded into shared memory for each thread 
block. The thread block then performs 32 tasks for each 
data element in the doubly dependent tile before copying 
the doubly dependent tile back to global memory. Clearly, 
the total data sent through the global memory bus is reduced 
by a factor of 32 in comparison with the implementation of 
Harish and Narayanan. Performing a similar bandwidth cal- 
culation as before, we would expect that, if the speed of the 
algorithm were limited by the memory bus, it would be ca- 
pable of performing 154 * 10^ tasks/second on the NVIDIA 
Tesla CI 060 in the ideal case, or perhaps half that if the 
performance were comparable to the previous algorithm. In 
fact, our results (see section [S| indicate that this algorithm 
can perform 14.9 * 10^ tasks/second, which indicates that 
it is limited by the rate it performs the tasks on the multi- 
processors and not by the rate it can send data between the 
multiprocessors and global memory. This is not a surprise, 
since, in the ideal case, we would need to be able to perform 
a task with just 6 FLOPs before bandwidth would be the 
determining factor. 



3.3 Thread Blocks and Multiprocessors in the 
Blocked Floyd- Warshall 

Each multiprocessor on a device of compute capability 1.3 
Uke the NVIDIA Tesla CI 060 can manage up to 1024 threads 
and has 16384 bytes of shared memory and 16384 one-word 
registers. A program will assign a large number of thread 
blocks, and multiple thread blocks may be processed simul- 
taneously on each multiprocessor, subject to the limitations 
just mentioned. 

The blocked Floyd- Warshall algorithm described above uses 
a tile size of 32. Each thread block processing a doubly de- 
pendent block copies three tiles to shared memory - the dou- 
bly dependent tile to process and two singly dependent tiles 
with the dependencies for doubly dependent tile. The to- 
tal shared memory consumed per block is 3 tiles *32^ words 
per tile *4 bytes per word -1-32 bytes for parameters = 12320 
bytes. Since this is more than half of the total shared mem- 
ory available per multiprocessor, only one block can be as- 
signed to a multiprocessor at any one time. 

196 threads assigned to a single multiprocessor can hide la- 
tency from register dependencies, and 512 threads is enough 
to hide latency of global memory access for most applications 
[12j . This assumes that a significant portion of the threads 
to be scheduled are available when needed. If a thread is 
waiting at a synchronization point, it can't be scheduled. 



1: {Assume all initialized to the weight of edge 

2: for all s; b < n/s do 

3: {Process the independent block first.} 

4: for all b*s^k<{b+l)*s do 

5: for all & * s ^ i < {b+ 1) * s do 

6: for all b * s ^ j < {b + 1) * s do 

7: Wij ^ min{wij,Wik + Wkj) 

8: end for 

9; end for 

10: end for 

11: {Tasks in a singly dependent block depend on the 

independent block.} 

12: {i-aligned singly dependent blocks} 

13: for all < ib < n/s do 

14: for all b * s < fc < (b -f 1) * s do 

15: for all b*s^i<{b+l)*s do 

16: for all ib * s !^ j < {ib + 1) * s do 

17: Wij ^ min{wij,Wik + Wkj) 

18: end for 

19: end for 

20: end for 

21: end for 

22: {j-aligned singly dependent blocks } 

23: for all O^jb^n/s do 

24: for all b * s < (b -f 1) * s do 

25: for all jb * s ^ i < {jb + 1) * s do 

26: for all b * s ^ j < (b + 1) * s do 

27: Wij -(^ min{wij,Wik + Wkj) 

28: end for 

29: end for 

30: end for 

31: end for 

32: {Most blocks are doubly dependent. Notice that k is 

now innermost.} 

33: for all ^ ib ^ n/s do 

34: for all ^ jb sC n/s do 

35: for all jb * s ^ i < {jb + 1) * s do 

36: for all ib * s ^ j < {ib + 1) * s do 

37: for all block * size ^ k < {block + 1) * size 
do 

38: Wij -h- min{wij , Wik + Wkj) 

39: end for 

40: end for 

41: end for 

42: end for 

43: end for 

44: end for 

Figure 2: Blocked Floyd- Warshall Algorithm 



In this application, the kernel first copies part of the tiles it 
needs from global memory, and then waits at a synchroniza- 
tion point. Since only one block can be assigned to each mul- 
tiprocessor, when the last thread to copy data from global 
memory executes its instruction, there are no threads avail- 
able to schedule. Accessing global memory has significant 
latency (hundreds of cycles) which will be exposed when the 
scheduler is starved for threads. 

4. STAGED BLOCKED FLOYD-WARSHALL 

We applied two rounds of improvements to a blocked im- 
plementation matching the performance reported in Katz 
and Kider. First, we performed standard optimizations to 
reduce the instruction count and use less expensive instruc- 
tions. Primarily, this was using bit shifts instead of division 
or modulus operators where possible and unrolling loops. 
These basic optimizations yielded a speedup of 2.1 — 2.3 x 

Once we had a well optimized basic blocked implementation, 
we applied two techniques to reduce the shared memory con- 
sumed by each thread block. First, we made more efficient 
use of the multiprocessor registers. Second, we staged the 
data load so that only a fraction of the singly dependent tiles 
containing the calculated dependencies for the current dou- 
bly dependent tile being processed were in shared memory 
at any one time. The net effect of these optimizations was to 
reduce the shared memory used by a thread block by a factor 
of nearly 12 without changing the amount of data flowing 
between global memory and the multiprocessors. Enabling 
multiple thread blocks to be assigned to each multiprocessor 
enables the thread scheduler to effectively hide the latency 
of accessing global memory and allows more tasks to be as- 
signed to each thread which reduces the computational over- 
head of calculating initial indexes. We changed from using 
256 threads per thread block to 64. These changes resulted 
in an additional 2.3 — 2.4 x speedup, for a total improvement 
of approximately 5.2 x. 

4.1 Efficient Use of Registers 

One critical observation is that each multiprocessor has more 
memory available in registers than in shared memory. For 
a device of compute capability 1.3, there are 16384 4-byte 
registers and a total of 16384 bytes of shared memory. By 
reading data that does not need to be shared among threads 
in a thread block directly to registers, the total shared mem- 
ory consumed can be reduced. This optimization will be 
useful when thread occupancy or the number of blocks as- 
signed to a multiprocessor is limited by shared memory and 
additional registers are still available. 

In this case, all of the tasks that involve updating a specific 
data element in the doubly dependent matrix are assigned 
to a single thread, and no thread needs to access the data 
elements assigned to another thread. As a result, the doubly 
dependent tile can be read directly into registers. Instead of 
reading the doubly dependent tile into an array of size t *t 
in shared memory, each thread reads its assigned elements 
into a local array of size t * t/h where h is the number of 
threads per thread block. In general, the compiler will create 
an array with constant indexes in registers. An array that 
doesn't have size determined at compile time will be created 
in local memory, which is vastly slower. 



Figure [3] shows the memory layout in shared memory when 
the doubly dependent tile is stored in the registers. This op- 
timization reduces the shared memory consumed per block 
to 2 * 32^ -I- 32 = 8224, which is more than half of the avail- 
able 16384. It is still only possible to assign a single thread 
block to each multiprocessor, so this improvement does not 
by itself affect performance. 

4.2 Staged Load of Singly Dependent Tiles 

To further reduce shared memory usage, we read the singly 
dependent blocks in stages. All of the tasks in the doubly 
dependent block in a certain fc-range will depend a specific 
subset of the data elements in the singly dependent tiles. 
Specifically, tasks in the range (i, j)k to {i, j)k+m will depend 
on data elements in the i-aligned tile in the range (i, k) to 
(j, k + m) and in the j-aligned matrix in the range (fe, j) to 
{k + m,j). Since the data elements of the doubly dependent 
tile are evenly assigned to the threads in a thread block, 
each thread will have the same number of tasks in a given 
fc-range. As a result, we can break the algorithm into stages, 
each separated by a synchronization point. In each stage, 
the data dependencies for the tasks in the next fc-range of 
size m are loaded into shared memory arrays, the threads are 
synchronized, and the next m tasks for each data element 
are performed. Using this approach, only t*m data elements 
for each singly dependent tile need to be loaded into shared 
memory at any one time. 

Figure |4] shows the usage of shared memory during a single 
stage of the update of the doubly dependent tile. In this 
case, the red areas of the i-aligned and j-aligned tiles contain 
all the dependencies for tasks in the doubly dependent tile 
over some range of k. When this range is updated, the next 
slice of each singly dependent tile will be loaded into shared 
memory, and the doubly dependent tile will be updated over 
the next k-range. 

Our specific implementation uses a tile size of 32 and stages 
the algorithm over 4 iterations. The total shared memory 
used in this implementation is 2 * 32 * 4 * 4 + 32 = 1056, so 
as many as 15 blocks could be run on each multiprocessor 
given the shared memory usage. The limiting factors are 
now the total threads assigned to a multiprocessor and the 
registers used for each thread block. 

4.3 Implementation 

There are two significant hurdles to implementing the staged 
load of the singly dependent tiles. First, it is necessary to 
retain coalesced access to global memory. A single half warp 
that reads from global memory should access aligned and 
contiguous words. Using a row-major data structure, a slice 
of the j-aligned tile will be composed of contiguous 16-word 
blocks but a slice of the i-aligned tile will not. 

Figure [S] shows this situation. A set of 4 adjacent columns 
from a row major matrix involves accessing only 4 adjacent 
data elements. By tiling the data in 4 by 4 blocks, we can 
take 4 rows or 4 columns with 16 or more adjacent elements 
in either direction without increasing the load on the global 
memory bus. 

In order to be able to read both a set of 4 columns and a 
set of 4 rows in contiguous 16-word blocks, we use a doubly 
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Figure 4: Shared Memory Usage with Doubly Dependent Tile Updated in Stages 



tiled, row-major data order similar to that used by Han, 
Franchetti, and Piischel [8]. In this ordering, each 32 by 32 
tile and each 4 by 4 tile is contiguous in memory. Within 
the entire matrix, the 32 by 32 tiles are arranged in row- 
major order, and within each 32 by 32 tile the 4 by 4 tiles 
are arranged in row-major order. 




Second, it is necessary to avoid shared memory bank con- 
flicts when accessing the singly dependent tiles. Shared 
memory is arranged into 16 banks. There are two patterns 
of memory access that allow a read from shared memory to 
be completed in a single cycle. Either all of the threads in a 
half warp must read the same data element, or each thread 
must read from a different memory bank. With all of the 
data elements assigned to a thread in a half warp adjacent 
in row-major order as they are in the Katz-Kider implemen- 
tation, the shared memory access is very naturally good. 
In the j-aligned tile, each thread accesses adjacent memory 
locations, and in the i-aligned tile, each thread in the half 
warp accesses the same memory location. When the data is 
arranged into 4 by 4 tiles, however, this simple memory ac- 
cess pattern breaks down. Threads 0, 4, 8, and 12 all access 
the same data element in the j-aligned tile and threads 0, 1, 
2, and 3 access the same element in the i-aligned tile, result- 
ing in 4-way data conflicts. This data access pattern causes 
each shared memory access to take 4 processor cycles. Since 
each task involves two reads from shared memory, this is a 
significant hit to performance. 



32 



Figure 5: Patterns of Global Memory Access for 
RoviT-Major and Tiled Data Orders 



To ameliorate this problem, it is important to remember 
that, although all of the tasks in a doubly dependent block 
must be performed, the order in which they are performed 
doesn't matter. As a result, different threads in a single 
half warp can process the tasks for their data elements in 
any order; they are not restricted to performing them in the 
most natural order, with k proceeding from to t. Instead, 
the first task performed by a thread in a given halfwarp is the 
task identified by the sum of the i and j indexes within the 
tile modulo 4. This results in conflict free shared memory 
access. 

Figure [6] shows different patterns of shared memory access 
that result from using a 4 by 4 tiled memory pattern. The 
top figure shows the memory access pattern when the ma- 
trix is in row-major order. In this case, the data elements 
processed by threads in a half warp are in a row 16 elements 
long. Each thread will access an element in the j-aligned 
tile with the same j-index as it has and k-index determined 
by the k-iteration it processes. Each thread will access the 
same element in the i-aligned tile, since they all have the 
same i-index. This results in a conflict free pattern of mem- 
ory access. Each element accessed in the j-aligned tile is in 
a different memory bank since they fall on 16 contiguous 
words. The single element in the i-aligned can be broadcast 
to all of the threads in a single cycle. 

The middle flgure shows what happens if a 4 by 4 tiled 
memory pattern is used with the same iteration over k as 
in the previous case. Since the threads in a half warp each 
have one of 4 j-indexes and one of 4 i-indexes, there are four 
different data elements in each tile that are accessed. This 
results in a 4 way data conflict, which uses at least 4 cycles 
per data access. Note that the same situation will arise if 
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Figure 6: Patterns of Shared Memory Access to Singly Dependent Tiles 



the data elements updated by threads in a half warp are 
arranged in a row of 16. Although the same indexes will 
be accessed as in the case using a row-major data ordering, 
these data elements will share data banks. For example, the 
data elements in the j-aligned tile having relative indexes 
(0, 0) and (4, 0) will be in the same bank. 

The bottom figure shows the approach we used to solve this 
problem. Instead of iterating over tasks {i,j)k to {i,j)k+4, in 
order, the task chosen depends on the position of the thread 
in it's half warp as described above. This results in conflict 
free shared memory data access. 



5. RESULTS AND ANALYSIS 

The tests were run on a computer having 1300 MHz AMD 
Phenom™ 9950 Quad-core processors and NVIDIA Tesla 
CI 060 GPUs. No program used more than 1 processor or 
GPU. All programs were compiled as 32-bit applications. 

Table [S] and figure [7] show the performance of our various 
implementations on different problem sizes. 

The CPU implementation is a basic implementation on the 
CPU. The Harish and Narayanan algorithm assigns a single 
thread for each task performed. The Katz and Kider algo- 
rithm overcomes the bandwidth limitations faced by Har- 
ish and Narayanan with blocking. The blocking size used 
is 32. The Optimized and Blocked algorithm is a version 
of the Katz and Kider algorithm that applies a number of 
optimizations to reduce the amount of time spent calculat- 
ing results. The Staged Load algorithm reduces the latency 
exposed in the Katz and Kider algorithm by reducing the 
amount of shared memory used. 

The Harish and Narayanan implementation is limited by 
the bandwidth of the global memory bus. Since each task 
requires 16 bytes of total traffic over the bus, the bandwidth 
achieved by this algorithm is 42 GB/sec, somewhat short 
of the 77 GB/sec bandwidth measured on our device for 
device-to-device memcpy. 

The Katz and Kider implementation is limited by the time 
spent on calculations. It can perform 14.9 * 10® tasks per 
second. Based on the advertised speed of 933 GFLOP/sec 
for a NVIDIA Tesla C1060, this is equivalent to using 62.7 
FLOPs for each task. Since each basic task consists of a sin- 
gle add, a minimum, and the accessory operations of locating 
and loading at least one dependency in shared memory and 
assigning the result of the calculation to a register, we could 
reasonably hope to get the actual total processing to under 
10 FLOPs per task. 

Our staged load implementation performs approximately 73.6* 
10® tasks per second. If it is limited by bandwidth, it 
achieves 46 GB/sec, which is less than the 70 GB/sec or 
so we could reasonably hope for. If it is limited by the 
processing speed, it is using the equivalent of 12.7 FLOPs 
per task. Since small changes to the code executed on the 
multiprocessor affect performance, we believe that it is still 
limited by the processing speed. In either case, it is possible 
that further optimizations could improve the performance 
of this blocked algorithm by a factor of 1.5 or less. 
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Table 1: Implementation Comparison Times 





Times 


Vertices 


CPU 


Harish & Narayanan 


Katz & Kider 


Optimized & Blocked 


Staged Load 


1024 


2.405 


0.408 


0.108 


0.0428 


0.0274 


2048 


18.38 


3.212 


0.65 


0.282 


0.14 


3072 


62.04 


10.99 


2.01 


0.653 


0.401 


4096 


145.2 


26.05 


4.62 


2.06 


0.934 


5120 




50.87 


8.84 


4.02 


1.76 


6I44 




87.9 


15.09 


6.89 


2.98 


7168 






23.82 


10.9 


4.65 


8192 




208.6 


35.37 


16.39 


6.88 


9216 






50.24 


23.05 


9.71 


10240 






68.67 


31.52 


13.22 


11264 






91.08 


41.82 


17.48 


12288 








54.05 


22.67 


13312 








68.56 


28.63 


14336 








85.56 


36.7 


15360 










43.74 


16384 






277.8 


126.9 


53.02 


17408 










63.4 




Figure 7: Graphs of Results 



